Avoid half-deleted cache trees on concurrent pip-compile --rebuild#2384
Avoid half-deleted cache trees on concurrent pip-compile --rebuild#2384gaborbernat wants to merge 2 commits into
Conversation
webknjaz
left a comment
There was a problem hiding this comment.
Not sure how I feel about adding more mocks. The general preference is having more integration tests..
ddd020c to
9eb0755
Compare
|
You're right, the PR title oversells what the change actually delivers. The atomic rename only collapses the specific window where The wider concurrent- Three options I see, happy to take direction:
I lean toward (1) because it keeps the PR small and reviewable while still fixing a real (if narrower) symptom; (3) is the right end-state but a meaningfully larger surface. Which would you prefer? |
96a590d to
7f6e67f
Compare
I fully agree. Let's clarify that this is not a full fix for the identified issue, and we can go after a complete solution in the future. |
7f6e67f to
d117ea7
Compare
d117ea7 to
5820e76
Compare
5820e76 to
2a6030e
Compare
webknjaz
left a comment
There was a problem hiding this comment.
The overall shape looks good, but I've got a few more suggestions and requests below.
| @@ -0,0 +1,6 @@ | |||
| Atomically rename the download directory before deleting it in | |||
There was a problem hiding this comment.
Instead of using imperative mood, rephrase this to use a comparison with the previous release (X now does Y) or the past tense (X was done/changed to do Y). Imperative mood is good for commit messages but in a change log, it makes it look like a TODO-list, which is not at all what we want.
| try: | ||
| os.replace(self._download_dir, stale) | ||
| except OSError: | ||
| return |
There was a problem hiding this comment.
This probably needs to be logged in the very verbose mode. Especially since any arbitrary OSError is suppressed w/o checking for specific error codes that may change over time.
cc @sirosen WDYT?
| # around the resolve, which is out of scope here. | ||
| if not os.path.exists(self._download_dir): | ||
| return | ||
| stale = f"{self._download_dir}.stale-{os.getpid()}" |
There was a problem hiding this comment.
Can we use a more specific var name? stale is quite generic. old_downloads_dir, perhaps?
|
|
||
| @pytest.fixture | ||
| def repo_with_cache_dir(tmp_path: Path) -> PyPIRepository: | ||
| return PyPIRepository([], cache_dir=str(tmp_path / "cache")) |
There was a problem hiding this comment.
Side note: cache_dir could've accepted pathlib objects already if we were to just relax typing in the initializer… An idea for a follow-up.
| ) -> None: | ||
| Path(repo_with_cache_dir._download_dir).mkdir(parents=True, exist_ok=True) | ||
| mocker.patch( | ||
| "piptools.repositories.pypi.os.replace", |
There was a problem hiding this comment.
Since you're already integrating pytest-mock, perhaps also apply a mocker.spy() on top, to confirm the args being passed?
| [project.optional-dependencies] | ||
| testing = [ | ||
| "pytest >= 7.2.0", | ||
| "pytest-mock >= 3.15.1", |
There was a problem hiding this comment.
Is this version really the lowest compatible? Let's find out which one is necessary and set it to that. Or drop the lower bound if any version works.
There was a problem hiding this comment.
This is the only version we tested agaisnt and know to work, unless we have a CI that runs against lowest version we're really just hoping that those low relaxed versions work.
There was a problem hiding this comment.
Yeah, but the lower boundaries should reflect APIs relied upon (or known broken things). Without such knowledge, we shouldn't assume that something won't work on older versions, this would only cause problems for downstream packagers for no good reason. After all, the depresolvers would optimize for latest releases anyway.
On a related note, we should look into switching from extras to depgroups once that's supported (I know tox does support these, but I wanted to wait until pip-tools itself does too).
There was a problem hiding this comment.
BTW I did lower it to 3.3.0 that's the first compatible version.
There was a problem hiding this comment.
That's too big to consider at the moment + I was hoping #2175 would go in.
There was a problem hiding this comment.
😊 one can dream and hope 😊
|
Tip @sirosen did you know you can use mermaid on GH natively? https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-diagrams#creating-mermaid-diagrams |
254ea5b to
dccaadd
Compare
Two ``pip-compile --rebuild`` invocations sharing a ``cache_dir`` race inside ``PyPIRepository.clear_caches``. Both call ``shutil.rmtree(self._download_dir, ignore_errors=True)`` on the same path, and the second resolver walks a tree the first started removing. ``ignore_errors=True`` masks the error but not the corruption. The patch swaps the in-place delete for an atomic ``os.replace`` to a per-PID sibling directory followed by ``rmtree`` of the renamed copy. ``os.replace`` is atomic on POSIX and NTFS, so a concurrent peer sees either the old directory or no directory, never a half-deleted tree. Scope is intentionally narrow: this fixes one observable symptom of jazzband#2393, not the wider race where a process is already using the cache and a peer ``--rebuild`` later renames the directory out from under it. Full concurrent-rebuild safety needs locking around the resolve and is left for future work.
dccaadd to
f129240
Compare
f129240 to
e05cb04
Compare
| old_downloads_dir = f"{self._download_dir}.stale-{os.getpid()}" | ||
| try: | ||
| os.replace(self._download_dir, old_downloads_dir) | ||
| except OSError as exc: |
There was a problem hiding this comment.
Could you use an os_err instead of a generic var name?
| # move (cross-device, permissions). The directory is no longer | ||
| # ours to clear; log at -v since the suppressed OSError is | ||
| # otherwise invisible and its error codes shift across platforms. | ||
| log.debug(f"clear_caches skipped {self._download_dir!r}: {exc}") |
There was a problem hiding this comment.
Should this be
| log.debug(f"clear_caches skipped {self._download_dir!r}: {exc}") | |
| log.debug(f"clear_caches skipped {self._download_dir!s}: {os_err}") |
? We probably don't need to call that repr interface.
e05cb04 to
828dace
Compare
| assert "file:small_fake_with_deps-0.1-py2.py3-none-any.whl" in out.stderr | ||
|
|
||
|
|
||
| @pytest.mark.skipif( |
There was a problem hiding this comment.
Is this directly related to the patch? How? Why do you think it's deterministic?
There was a problem hiding this comment.
https://github.com/jazzband/pip-tools/actions/runs/25870600925/job/76024490776#step:9:1111 proves that this test change is not related to this PR as it was failing on main before. Meaning this shouldn't be included in the scope.
828dace to
87816c9
Compare
| testing = [ | ||
| "pytest >= 7.2.0", | ||
| "pytest-mock >= 3.15.1", | ||
| "pytest-mock >= 3.3.0", |
There was a problem hiding this comment.
But is needed for this PR. I don't know man this feels to me like we're being pedantic and generating lots of new work for negligable benefit 🤷
There was a problem hiding this comment.
A good Git hygiene means being able to git revert things w/o additional edits by having atomic changes bundled. The addition of that dep with a wrong version spec slipped into main from another PR, so this PR doesn't need to change it, but the change still needs to happen separately.
I guess we could keep it in this PR if it moves to an atomic commit within the branch.
There was a problem hiding this comment.
While I can sympathize with the sentiment here, at the same time there is a trade-off to be made about chasing perfection and balancing the return on the time investment we are asking of people. My personal stance is that we should aim for good enough. Not everything needs to be a revert, and often, when something goes wrong in my experience, it's easier to just follow it up with the pull request that either fixes that issue or reverses the part of a bigger PR that didn't work out. This very easily could be just a difference of style when contributing and maintaining projects between us.
There was a problem hiding this comment.
I understand, it's just not good enough. Good shape of Git is important to me. Otherwise, we'd be allowing fake squash merges and other questionable approaches.
87816c9 to
9f09086
Compare
|
The spy question stands. Also waiting for Stephen's feedback on questions. Perhaps, this should be marked as a draft while being constantly force-pushed? |
It's ready from POV, I'm just addressing your constant feedback, hence constant force-push. |
|
FYI, it's a good tone to follow https://hynek.me/til/easier-crediting-contributors-github/. |
webknjaz's review left several requests on the parallel-rebuild fix.
The renamed-aside directory variable moves from the generic `stale` to
`old_downloads_dir` so the name says what it holds. The suppressed
`OSError` now logs at `-v` rather than vanishing: a racing peer or an
OS-level rename refusal is worth a breadcrumb, and the error codes that
trigger it shift across platforms, so silently swallowing it hides a
real signal.
The changelog fragment moves out of imperative mood into the "X now
does Y" form the towncrier convention expects, routes the second
`pip-compile --rebuild` mention through the `{command}` role, and gains
the `by` keyword in the byline. `changelog.d/2384.bugfix.md` is a
symlink to `2393.bugfix.md` so the note renders under both the issue
and the PR.
The `pytest-mock` lower bound drops to match the unbounded style of the
sibling test dependencies (`pytest-rerunfailures`, `pytest-xdist`,
`tomli-w`); pinning one test helper to a single tested version while
the rest float was inconsistent. `test_clear_caches_swallows_replace_
failure` now asserts the rename was attempted against the live download
directory before the OSError short-circuits the rmtree.
9f09086 to
b58c4bb
Compare
|
This was almost ready, I'd rather see it completed (maybe by my or Stephen). |

Two
pip-compile --rebuildinvocations sharing acache_dirrace insidePyPIRepository.clear_caches. Both callshutil.rmtree(self._download_dir, ignore_errors=True)on the same path, and the second resolver walks a tree the first started removing.ignore_errors=Truemasks the error but not the corruption.The patch swaps the in-place delete for an atomic
os.replaceto a per-PID sibling directory, then deletes the renamed copy.os.replaceis atomic on POSIX and NTFS, so a concurrent peer sees the original directory or no directory, never a half-deleted tree.Scope is intentionally narrow per discussion with @sirosen below: this fixes one observable symptom of #2393, not the wider race where a process is already using the cache and a peer
--rebuildlater renames the directory out from under it. Full concurrent-rebuild safety needs locking around the resolve and is left for future work.Refs #2393.